TABLE OF CONTENTS


SETUP

PROBLEM STATEMNT

In a Machine Learning Housing Corporation! The first task you are asked to perform is to build a model of housing prices in California using the California census data.

The model should learn from this data and be able to predict the median housing price in any district, given all the other metrics.

DATASET

The California Housing Prices dataset from the StatLib repository. This dataset was based on data from the 1990 California census. This data has metrics such as the population, median income, median housing price, and so on for each block group in California.

Block groups are the smallest geographical unit for which the US Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). We will just call them “districts” for short.

The median income attribute does not look like it is expressed in US dollars (USD).

Attributes have very different scales

SPERATE THE TESTING SET

Splitting manulaly

Splitting using StratifiedShuffleSplit

Check the distribution testing compared to the whole dataset

I compared the income category proportions in the overall dataset, in the test set generated with stratified sampling, and in a test set generated using purely random sampling.

The test set generated using stratified sampling has income category proportions almost identical to those in the full dataset, whereas the test set generated using purely random sampling is quite skewed.

DATA EXPLORATION

Visualizing Geographical Data

Check Correlation

PREPARING DATA FOR ML ALGORITHMS

Handling nulls option 1 : drop nulls

Handling nulls option 2 : fill nulls by median

Handling nulls option 3 : fill nulls by median usinf scikit-Learn

Handling Text and Categorical Attributes

One issue with this representation is that ML algorithms will assume that two nearby values are more similar than two distant values. This may be fine in some cases (e.g., for ordered categories such as “bad”, “average”, “good”, “excellent”), but it is obviously not the case for the ocean_proximity column (for example, categories 0 and 4 are clearly more similar than categories 0 and 1)

Customized attributes

Create Pipline

TRAINING AND EVALUATING

Linear Regression Model

A typical prediction error of 68,637.14$ is not very satisfying. This is an example of a model underfitting the training data

Dession Tree Model

No error at all?! it is much more likely that the model has badly overfit the data.I don’t want to touch the test set until you are ready to launch a model you are confident about, so you need to use part of the training set for training, and part for model validation.

Random Forest Model

Suport Vector Machine Model

FINE TUNING

Evaluate models Cross-Validation

The Decision Tree model is overfitting so badly that it performs worse than the Linear Regression model.

Random Forests look very promising. However,the score on the training set is still much lower than on the validation sets, meaning that the model is still overfitting the training set.

Analyze the Best Models and Their Errors

TESTING BEST MODEL

FINAL PIPLINE WITH TESTING